I initially found about this course thanks to a good friend and colleague. The subjects in this course interest me a lot. I believe that science should be shared as much as possible. However, I do lack some knowledge and experience relating to it. My most important contributions, the multicast and linear optimization extensions of NCorg DNC (previously DiscoDNC tool) have been shared in my old github profile. However, I believe that it was done in a limited manner. More specifically, I doubt that another researcher would be able to easily reproduce my experiments without a lot of effort on their part.
Furthermore, I believe that over time my data analysis has improved quite a lot. However, I never had a focused lecture on how to analyse and share information related to results of experiemnts. Meaning I am quite excited for the data science portion of this lecture.
Regarding the crash course on R. Like I mentioned in my introduction (in Moodle) I have used R during my master. In fact, this was the time I worked on the NCorg DNC tool I mentioned above. Therefore, the crash course was not necessary for me, as I already know the basics. What was new to me was Rstudio. I am quite positively surprised about it actually. My initial experiences were quite negative, and most of my previous work was done using Sublime text editor and directly using Rscript. The last class and exercises made me see the potential in this programming environment, and I am looking forward to future exercises to see how this tool can help me.
This set consists of a few numbered exercises. Go to each exercise in turn and do as follows:
One or more extra packages (in addition to tidyverse)
will be needed below.
# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):
# install.packages("GGally")
The first step of data analysis with R is reading data into R. This is done with a function. Which function and function arguments to use to do this, depends on the original format of the data.
Conveniently in R, the same functions for reading data can usually be used whether the data is saved locally on your computer or somewhere else behind a web URL.
After the correct function has been identified and data read into R,
the data will usually be in R data.frame format. Te
dimensions of a data frame are (\(n\),\(d\)), where \(n\) is the number of rows (the
observations) and \(d\) the number of
columns (the variables).
The purpose of this course is to expose you to some basic and more advanced tasks of programming and data analysis with R.
lrn14 data frame to memory with
read.table(). There is information related to the data heredim() on the data frame to look at the dimensions
of the data. How many rows and colums does the data have?str().Hint: - For both functions you can pass a data frame as the first (unnamed) argument.
# This is a code chunk in RStudio editor.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# read the data into memory
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
# Look at the dimensions of the data
dim(lrn14)
## [1] 183 60
# Look at the structure of the data
str(lrn14)
## 'data.frame': 183 obs. of 60 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
The next step is wrangling the data into a format that is easy to analyze. We will wrangle our data for the next few exercises.
A neat thing about R is that may operations are vectorized. It means that a single operation can affect all elements of a vector. This is often convenient.
The column Attitude in lrn14 is a sum of 10
questions related to students attitude towards statistics, each measured
on the Likert
scale (1-5). Here we’ll scale the combination variable back to the
1-5 scale.
attitude in
the lrn14 data frame, where each observation of
Attitude is scaled back to the original scale of the
questions, by dividing it with the number of questions.Hint: - Assign ‘Attitude divided by 10’ to the new column ’attitude.
# This is a code chunk in RStudio editor.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
#lrn14 is available
# divide each number in a vector
c(1,2,3,4,5) / 2
## [1] 0.5 1.0 1.5 2.0 2.5
# print the "Attitude" column vector of the lrn14 data
lrn14$Attitude
## [1] 37 31 25 35 37 38 35 29 38 21 39 38 24 30 26 25 41 26 26 17 27 39 34 27 23
## [26] 37 44 41 24 37 25 30 34 32 20 24 42 16 31 38 38 33 17 25 32 35 32 42 31 39
## [51] 19 21 25 32 32 26 23 38 28 33 48 40 40 47 23 31 27 41 34 34 25 21 14 19 37
## [76] 41 32 28 41 25 28 38 31 35 36 26 44 45 32 20 39 25 33 35 33 30 29 33 33 35
## [101] 36 42 37 28 42 22 32 50 47 36 36 29 35 40 35 32 26 20 27 32 33 39 33 30 37
## [126] 14 30 25 29 39 36 29 21 31 24 40 31 23 28 37 26 24 30 29 32 28 29 24 31 19
## [151] 20 38 34 37 29 23 41 27 35 34 32 33 33 35 32 31 24 28 17 19 32 35 24 38 21
## [176] 29 19 20 42 41 37 36 18
# divide each number in the column vector
attitudeDiv = lrn14$Attitude / 10
# create column 'attitude' by scaling the column "Attitude"
lrn14$attitude <- attitudeDiv
Our data includes many questions that can be thought to measure the same dimension. You can read more about the data and the variables here. Here we’ll combine multiple questions into combination variables. Useful functions for summation with data frames in R are
| function | description |
|---|---|
colSums(df) |
returns a sum of each column in df |
rowSums(df) |
returns a sum of each row in df |
colMeans(df) |
returns the mean of each column in df |
rowMeans(df) |
return the mean of each row in df |
We’ll combine the use of rowMeans()with the
select() function from the dplyr library
to average the answers of selected questions. See how it is done from
the example codes.
lrn14lrn14lrn14Hints: - Columns related to strategic learning are in the object
strategic_questions. Use it for selecting the correct
columns. - Use the function rowMeans() identically to the
examples
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lrn14 is available
# Access the dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# questions related to deep, surface and strategic learning
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
# select the columns related to deep learning
deep_columns <- select(lrn14, one_of(deep_questions))
# and create column 'deep' by averaging
lrn14$deep <- rowMeans(deep_columns)
# select the columns related to surface learning
surface_columns <- select(lrn14, one_of(surface_questions))
# and create column 'surf' by averaging
lrn14$surf <- rowMeans(surface_columns)
# select the columns related to strategic learning
strategic_columns <- select(lrn14, one_of(strategic_questions))
# and create column 'stra' by averaging
lrn14$stra <- rowMeans(strategic_columns)
Often it is convenient to work with only a certain column or a subset
of columns of a bigger data frame. There are many ways to select columns
of data frame in R and you saw one of them in the previous exercise:
select() from dplyr*.
dplyr is a popular library for data wrangling. There are also convenient data wrangling cheatsheets to help you get started (dplyr, tidyr etc.)
keep_columnsselect() (possibly together with
one_of()) to create a new data frame
learning2014 with the columns named in
keep_columns.Hint: - See the previous exercise for help on how to select columns
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lrn14 is available
# access the dplyr library
library(dplyr)
# choose a handful of columns to keep
keep_columns <- c("gender","Age","attitude", "deep", "stra", "surf", "Points")
# select the 'keep_columns' to create a new dataset
learning2014 <- lrn14[keep_columns]
# see the structure of the new dataset
dim(learning2014)
## [1] 183 7
Sometimes you want to rename your colums. You could do this by
creating copies of the columns with new names, but you can also directly
get and set the column names of a data frame, using the function
colnames().
The dplyr library has a rename()
function, which can also be used. Remember the cheatsheets.
learning2014Hint: - You can use colnames() similarily to the
example. Which index matches the column ‘Points’?
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# print out the column names of the data
colnames(learning2014)
## [1] "gender" "Age" "attitude" "deep" "stra" "surf" "Points"
# change the name of the second column
colnames(learning2014)[2] <- "age"
# change the name of "Points" to "points"
colnames(learning2014)[7] <- "points"
# print out the new column names of the data
colnames(learning2014)
## [1] "gender" "age" "attitude" "deep" "stra" "surf" "points"
Often your data includes outliers or other observations which you wish to remove before further analysis. Or perhaps you simply wish to work with some subset of your data.
In the learning2014 data the variable ‘points’ denotes the students exam points in a statistics course exam. If the student did not attend an exam, the value of ‘points’ will be zero. We will remove these observations from the data.
male_students by selecting
the male students from learning2014learning2014 and select rows where the
‘points’ variable is greater than zero.Hint: - The “greater than” logical operator is >
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# access the dplyr library
library(dplyr)
# select male students
male_students <- filter(learning2014, gender == "M")
# select rows where points is greater than zero
learning2014 <- filter(learning2014, points > 0)
ggplot2 is a popular library for creating stunning graphics with R. It has some advantages over the basic plotting system in R, mainly consistent use of function arguments and flexible plot alteration. ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics — a general scheme for data visualization.
In ggplot2, plots may be created via the convenience function
qplot() where arguments and defaults are meant to be
similar to base R’s plot() function. More complex plotting
capacity is available via ggplot(), which exposes the user
to more explicit elements of the grammar. (from wikipedia)
There is also a cheatsheet for data visualization with ggplot2.
col = gender inside aes().ggtitle("<insert title here>") to the plot with
regression lineHints: - Use + to add the title to the plot - The plot
with regression line is saved in the object p3 - You can
draw the plot by typing the object name where the plot is saved
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# Access the gglot2 library
library(ggplot2)
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col=gender))
# define the visualization type (points)
p2 <- p1 + geom_point()
# draw the plot
p2
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")
# draw the plot!
p4
## `geom_smooth()` using formula = 'y ~ x'
Often the most interesting feature of your data are the relationships between the variables. If there are only a handful of variables saved as columns in a data frame, it is possible to visualize all of these relationships neatly in a single plot.
Base R offers a fast plotting function pairs(), which
draws all possible scatter plots from the columns of a data frame,
resulting in a scatter plot matrix. Libraries GGally
and ggplot2 together offer a slow but more detailed
look at the variables, their distributions and relationships.
col to the
pairs() function, defining the colour with the ‘gender’
variable in learning2014.p with
ggpairs().mapping of ggpairs()
by defining col = gender inside aes().alpha = 0.3 inside aes().Hints: - You can use $ to access a column of a data
frame. - Remember to separate function arguments with a comma - You can
draw the plot p by simply typing it’s name: just like
printing R objects.
We see that on average the students were 25 years old, most students seem to have a low attitude in regards to statistics. Given that the largest score in the exam was 33 points, we will assume that was the maximum score. If that is the case, we have that most students seem to have done quite well, with the majority of students being around 2x.xx points.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
In order to facilitate the analysis, we plot the histograms of the different variables. We can see that indeed the age tends to the left of the plot, meaning that the vast majority of people were around 2x years old. The attitude is clearly in the middle, looking similar to a normal distribution. Similar behavior can be seen to Deep, Stra and Surf, although it is important to note that Stra tends more to an uniform distribution when compared to the others. Finally, as I mentioned before we can see that the majority of people got a relatively good grade, around 2x points.
par(mfrow=c(2,3))
hist(learning2014$age,prob=T,main="Age")
hist(learning2014$attitude,prob=T,main="Attitude")
hist(learning2014$deep,prob=T,main="Deep")
hist(learning2014$stra,prob=T,main="Stra")
hist(learning2014$surf,prob=T,main="Surf")
hist(learning2014$points,prob=T,main="Points")
From this, we can already make a selection of the most interesting variables. We see that Age does not seem to correlate to the points column. However, Attitude seems to have a strong correlation. It is followed by Stra and then Surf, which seems to have an inverse relation to our target variable. Therefore, we choose Stra, Attitude and Surf for our model. We can double check our choices by literally checking the correlation of the variables to our target. As the previous plots suggested, our selection seem to have the best (inverse)association.
Regression
analysis with R is easy once you have your data in a neat data
frame. You can simply use the lm() function to fit a linear
model. The first argument of lm() is a
formula, which defines the target variable and the
explanatory variable(s).
The formula should be y ~ x, where y is the
target (or outcome) variable and x the explanatory variable
(predictor). The second argument of lm() is
data, which should be a data frame where y and
x are columns.
The output of lm() is a linear model object, which can
be saved for later use. The generic function summary() can
be used to print out a summary of the model.
Hints: - Replace 1 with the name of the explanatory
variable in the formula inside lm() - Use
summary() on the model object to print out a summary
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
# fit a linear model
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
When there are more than one explanatory variables in the linear
model, it is called multiple regression. In R, it is easy to include
more than one explanatory variables in your linear model. This is done
by simply defining more explanatory variables with the
formula argument of lm(), as below
y ~ x1 + x2 + ..
Here y is again the target variable and
x1, x2, .. are the explanatory variables.
ggpairs()points is the target
variable and both attitude and stra are the
explanatory variables.Hint: - The variable with the third highest absolute correlation with
points is surf.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
library(GGally)
library(ggplot2)
# create an plot matrix with ggpairs()
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# create a regression model with multiple explanatory variables
my_model3 <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
When looking at the summary of the models, we can see that attitude has indeed the lowest p-values. We also see that for both stra and surf we actually have too large of p-values, meaning that our model would probably perform better if we were to only use attitude.
R makes it easy to graphically explore the validity of your model
assumptions. If you give a linear model object as the first argument to
the plot() function, the function automatically assumes you
want diagnostic plots and will produce them. You can check the help page
of plotting an lm object by typing ?plot.lm or
help(plot.lm) to the R console.
In the plot function you can then use the argument which
to choose which plots you want. which must be an integer
vector corresponding to the following list of plots:
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
We will focus on plots 1, 2 and 5: Residuals vs Fitted values,
Normal QQ-plot and Residuals vs Leverage.
my_model2plot()
function: Residuals vs Fitted values, Normal QQ-plot and Residuals vs
Leverage using the argument which.plot() function, add the
following: par(mfrow = c(2,2)). This will place the
following 4 graphics to the same plot. Execute the code again to see the
effect.Hint: - You can combine integers to an integer vector with
c(). For example: c(1,2,3).
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1,2,5))
From these plots we can see that our model works well for points close to the mean. However, for both extremities we have much larger losses.
Okay, so let’s assume that we have a linear model which seems to fit our standards. What can we do with it?
The model quantifies the relationship between the explanatory variable(s) and the dependent variable. The model can also be used for predicting the dependent variable based on new observations of the explanatory variable(s).
In R, predicting can be done using the predict()
function. (see ?predict). The first argument of predict is
a model object and the argument newdata (a data.frame) can
be used to make predictions based on new observations. One or more
columns of newdata should have the same name as the
explanatory variables in the model object.
m and print out a summary of the
modelnew_attitudesnew_attitudespredict() the new student’s exam points based on their
attitudes, using the newdata argumentHints: - Type attitude = new_attitudes inside the
data.frame() function. - Give the new_data
data.frame as the newdata argument for
predict()
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available
# Create model object m
m <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(m)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
new_data
## attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
# Predict the new students exam points based on attitude
predict(m, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
Awesome work!
As with before, I always prefer reading the data from the url just in case I made some mistake in the previous step.
This plot gives us some insight on the variables of the dataset.
library(tidyr); library(dplyr); library(ggplot2)
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")
From the previous plots we select “sex”, “absences”, “reason”, “traveltime”. My initial idea for selecting them is
sex: both the differences in males and females (I assume males drink more alcohol or feel more inclined to reporting it).
absences: maybe we could see some people who like to party more, or just have a more carefree lifestyle.
reason: the reasons could also reflect on the lifestyle
traveltime: perhaps the traveltime time could point to different wealth status, which I would think could also affect alcohol consumption.
# draw a bar plot for alc variable
alc %>% select(alc_use) %>% gather %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")
# draw a bar plot of each variable
alc %>% select(sex, absences, reason, traveltime) %>% gather %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")
For comparing the multiple variables I decided to use the crosstable function. The first variable we look at is the sex. Both sexes have a rather considerable chi-square value, meaning that this variable seems indeed interesting for our modelling purposes.
If the formatting is broken, I suggest opening them in a new window. This order was selected due to the high number of elements of some variables.
library(gmodels)
CrossTable(alc$sex, alc$alc_use)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | alc$alc_use
## alc$sex | 1 | 1.5 | 2 | 2.5 | 3 | 3.5 | 4 | 4.5 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## F | 87 | 41 | 26 | 25 | 11 | 3 | 1 | 0 | 1 | 195 |
## | 2.367 | 1.831 | 0.418 | 0.532 | 2.040 | 3.964 | 2.954 | 1.581 | 2.954 | |
## | 0.446 | 0.210 | 0.133 | 0.128 | 0.056 | 0.015 | 0.005 | 0.000 | 0.005 | 0.527 |
## | 0.621 | 0.651 | 0.464 | 0.610 | 0.344 | 0.176 | 0.111 | 0.000 | 0.111 | |
## | 0.235 | 0.111 | 0.070 | 0.068 | 0.030 | 0.008 | 0.003 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## M | 53 | 22 | 30 | 16 | 21 | 14 | 8 | 3 | 8 | 175 |
## | 2.638 | 2.040 | 0.466 | 0.593 | 2.273 | 4.417 | 3.292 | 1.762 | 3.292 | |
## | 0.303 | 0.126 | 0.171 | 0.091 | 0.120 | 0.080 | 0.046 | 0.017 | 0.046 | 0.473 |
## | 0.379 | 0.349 | 0.536 | 0.390 | 0.656 | 0.824 | 0.889 | 1.000 | 0.889 | |
## | 0.143 | 0.059 | 0.081 | 0.043 | 0.057 | 0.038 | 0.022 | 0.008 | 0.022 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 140 | 63 | 56 | 41 | 32 | 17 | 9 | 3 | 9 | 370 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
A chi-square value of 0 would mean that the variables are completely independent. To my surprise, the highest consumption of alcohol seems to have a larger dependency to the lower absences. They seem to have some relation, just not what I expected.
library(gmodels)
CrossTable(alc$absences, alc$alc_use)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | alc$alc_use
## alc$absences | 1 | 1.5 | 2 | 2.5 | 3 | 3.5 | 4 | 4.5 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 31 | 10 | 9 | 6 | 4 | 3 | 0 | 0 | 0 | 63 |
## | 2.152 | 0.049 | 0.030 | 0.138 | 0.385 | 0.004 | 1.532 | 0.511 | 1.532 | |
## | 0.492 | 0.159 | 0.143 | 0.095 | 0.063 | 0.048 | 0.000 | 0.000 | 0.000 | 0.170 |
## | 0.221 | 0.159 | 0.161 | 0.146 | 0.125 | 0.176 | 0.000 | 0.000 | 0.000 | |
## | 0.084 | 0.027 | 0.024 | 0.016 | 0.011 | 0.008 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 18 | 10 | 9 | 5 | 6 | 2 | 0 | 0 | 0 | 50 |
## | 0.045 | 0.260 | 0.271 | 0.053 | 0.649 | 0.038 | 1.216 | 0.405 | 1.216 | |
## | 0.360 | 0.200 | 0.180 | 0.100 | 0.120 | 0.040 | 0.000 | 0.000 | 0.000 | 0.135 |
## | 0.129 | 0.159 | 0.161 | 0.122 | 0.188 | 0.118 | 0.000 | 0.000 | 0.000 | |
## | 0.049 | 0.027 | 0.024 | 0.014 | 0.016 | 0.005 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 24 | 8 | 8 | 3 | 6 | 2 | 3 | 0 | 2 | 56 |
## | 0.373 | 0.247 | 0.027 | 1.656 | 0.276 | 0.128 | 1.969 | 0.454 | 0.299 | |
## | 0.429 | 0.143 | 0.143 | 0.054 | 0.107 | 0.036 | 0.054 | 0.000 | 0.036 | 0.151 |
## | 0.171 | 0.127 | 0.143 | 0.073 | 0.188 | 0.118 | 0.333 | 0.000 | 0.222 | |
## | 0.065 | 0.022 | 0.022 | 0.008 | 0.016 | 0.005 | 0.008 | 0.000 | 0.005 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 21 | 6 | 4 | 4 | 1 | 0 | 1 | 0 | 1 | 38 |
## | 3.049 | 0.034 | 0.533 | 0.011 | 1.591 | 1.746 | 0.006 | 0.308 | 0.006 | |
## | 0.553 | 0.158 | 0.105 | 0.105 | 0.026 | 0.000 | 0.026 | 0.000 | 0.026 | 0.103 |
## | 0.150 | 0.095 | 0.071 | 0.098 | 0.031 | 0.000 | 0.111 | 0.000 | 0.111 | |
## | 0.057 | 0.016 | 0.011 | 0.011 | 0.003 | 0.000 | 0.003 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 14 | 5 | 5 | 5 | 2 | 3 | 0 | 0 | 1 | 35 |
## | 0.043 | 0.154 | 0.017 | 0.324 | 0.348 | 1.205 | 0.851 | 0.284 | 0.026 | |
## | 0.400 | 0.143 | 0.143 | 0.143 | 0.057 | 0.086 | 0.000 | 0.000 | 0.029 | 0.095 |
## | 0.100 | 0.079 | 0.089 | 0.122 | 0.062 | 0.176 | 0.000 | 0.000 | 0.111 | |
## | 0.038 | 0.014 | 0.014 | 0.014 | 0.005 | 0.008 | 0.000 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 4 | 7 | 5 | 2 | 2 | 1 | 0 | 0 | 1 | 22 |
## | 2.246 | 2.827 | 0.838 | 0.079 | 0.005 | 0.000 | 0.535 | 0.178 | 0.404 | |
## | 0.182 | 0.318 | 0.227 | 0.091 | 0.091 | 0.045 | 0.000 | 0.000 | 0.045 | 0.059 |
## | 0.029 | 0.111 | 0.089 | 0.049 | 0.062 | 0.059 | 0.000 | 0.000 | 0.111 | |
## | 0.011 | 0.019 | 0.014 | 0.005 | 0.005 | 0.003 | 0.000 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 10 | 3 | 3 | 3 | 1 | 0 | 1 | 0 | 0 | 21 |
## | 0.531 | 0.093 | 0.010 | 0.195 | 0.367 | 0.965 | 0.468 | 0.170 | 0.511 | |
## | 0.476 | 0.143 | 0.143 | 0.143 | 0.048 | 0.000 | 0.048 | 0.000 | 0.000 | 0.057 |
## | 0.071 | 0.048 | 0.054 | 0.073 | 0.031 | 0.000 | 0.111 | 0.000 | 0.000 | |
## | 0.027 | 0.008 | 0.008 | 0.008 | 0.003 | 0.000 | 0.003 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 3 | 2 | 4 | 1 | 0 | 1 | 1 | 0 | 0 | 12 |
## | 0.523 | 0.001 | 2.626 | 0.082 | 1.038 | 0.365 | 1.718 | 0.097 | 0.292 | |
## | 0.250 | 0.167 | 0.333 | 0.083 | 0.000 | 0.083 | 0.083 | 0.000 | 0.000 | 0.032 |
## | 0.021 | 0.032 | 0.071 | 0.024 | 0.000 | 0.059 | 0.111 | 0.000 | 0.000 | |
## | 0.008 | 0.005 | 0.011 | 0.003 | 0.000 | 0.003 | 0.003 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 8 | 6 | 4 | 4 | 1 | 2 | 1 | 1 | 1 | 0 | 20 |
## | 0.325 | 0.104 | 0.313 | 0.667 | 0.042 | 0.007 | 0.542 | 4.329 | 0.486 | |
## | 0.300 | 0.200 | 0.200 | 0.050 | 0.100 | 0.050 | 0.050 | 0.050 | 0.000 | 0.054 |
## | 0.043 | 0.063 | 0.071 | 0.024 | 0.062 | 0.059 | 0.111 | 0.333 | 0.000 | |
## | 0.016 | 0.011 | 0.011 | 0.003 | 0.005 | 0.003 | 0.003 | 0.003 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 2 | 2 | 1 | 3 | 1 | 1 | 0 | 0 | 1 | 11 |
## | 1.123 | 0.009 | 0.266 | 2.603 | 0.002 | 0.484 | 0.268 | 0.089 | 2.005 | |
## | 0.182 | 0.182 | 0.091 | 0.273 | 0.091 | 0.091 | 0.000 | 0.000 | 0.091 | 0.030 |
## | 0.014 | 0.032 | 0.018 | 0.073 | 0.031 | 0.059 | 0.000 | 0.000 | 0.111 | |
## | 0.005 | 0.005 | 0.003 | 0.008 | 0.003 | 0.003 | 0.000 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 10 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 7 |
## | 0.689 | 1.192 | 0.003 | 0.776 | 0.605 | 1.431 | 0.170 | 0.057 | 4.043 | |
## | 0.571 | 0.000 | 0.143 | 0.000 | 0.000 | 0.143 | 0.000 | 0.000 | 0.143 | 0.019 |
## | 0.029 | 0.000 | 0.018 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | 0.111 | |
## | 0.011 | 0.000 | 0.003 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 11 | 0 | 2 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 5 |
## | 1.892 | 1.550 | 0.757 | 3.774 | 0.745 | 0.230 | 0.122 | 0.041 | 0.122 | |
## | 0.000 | 0.400 | 0.000 | 0.400 | 0.200 | 0.000 | 0.000 | 0.000 | 0.000 | 0.014 |
## | 0.000 | 0.032 | 0.000 | 0.049 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.005 | 0.000 | 0.005 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 12 | 1 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 0 | 7 |
## | 1.026 | 1.192 | 0.835 | 0.776 | 3.213 | 0.322 | 19.662 | 0.057 | 0.170 | |
## | 0.143 | 0.000 | 0.286 | 0.000 | 0.286 | 0.000 | 0.286 | 0.000 | 0.000 | 0.019 |
## | 0.007 | 0.000 | 0.036 | 0.000 | 0.062 | 0.000 | 0.222 | 0.000 | 0.000 | |
## | 0.003 | 0.000 | 0.005 | 0.000 | 0.005 | 0.000 | 0.005 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 13 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 |
## | 0.757 | 1.277 | 0.303 | 0.222 | 0.173 | 0.092 | 0.049 | 0.016 | 18.604 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.005 |
## | 0.000 | 0.016 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.111 | |
## | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 14 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 2 | 1 | 7 |
## | 2.649 | 1.192 | 0.003 | 6.378 | 0.605 | 0.322 | 0.170 | 66.533 | 4.043 | |
## | 0.000 | 0.000 | 0.143 | 0.429 | 0.000 | 0.000 | 0.000 | 0.286 | 0.143 | 0.019 |
## | 0.000 | 0.000 | 0.018 | 0.073 | 0.000 | 0.000 | 0.000 | 0.667 | 0.111 | |
## | 0.000 | 0.000 | 0.003 | 0.008 | 0.000 | 0.000 | 0.000 | 0.005 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 16 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 9.649 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 17 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 7.135 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.024 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 18 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 |
## | 0.757 | 1.277 | 0.303 | 0.222 | 0.173 | 8.974 | 0.049 | 0.016 | 0.049 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.000 | 0.500 | 0.000 | 0.000 | 0.000 | 0.005 |
## | 0.000 | 0.016 | 0.000 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 19 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 9.649 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 20 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
## | 0.757 | 8.087 | 0.303 | 0.222 | 0.173 | 0.092 | 0.049 | 0.016 | 0.049 | |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.005 |
## | 0.000 | 0.032 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 21 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 |
## | 0.078 | 0.341 | 0.303 | 0.222 | 0.173 | 8.974 | 0.049 | 0.016 | 0.049 | |
## | 0.500 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.000 | 0.000 | 0.000 | 0.005 |
## | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | 0.000 | |
## | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 26 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 9.649 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 27 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 7.135 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.024 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 29 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 9.649 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 44 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.378 | 0.170 | 0.151 | 7.135 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.000 | 0.000 | 0.000 | 0.024 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 45 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 1.021 | 0.170 | 0.151 | 0.111 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 |
## | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 140 | 63 | 56 | 41 | 32 | 17 | 9 | 3 | 9 | 370 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
For the reason it seems that mostly they seem independent. However, there are variables which show a higher score such as reputation and consumption 2.
library(gmodels)
CrossTable(alc$reason, alc$alc_use)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | alc$alc_use
## alc$reason | 1 | 1.5 | 2 | 2.5 | 3 | 3.5 | 4 | 4.5 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## course | 53 | 26 | 18 | 12 | 12 | 7 | 3 | 1 | 3 | 135 |
## | 0.072 | 0.395 | 0.290 | 0.585 | 0.009 | 0.102 | 0.025 | 0.008 | 0.025 | |
## | 0.393 | 0.193 | 0.133 | 0.089 | 0.089 | 0.052 | 0.022 | 0.007 | 0.022 | 0.365 |
## | 0.379 | 0.413 | 0.321 | 0.293 | 0.375 | 0.412 | 0.333 | 0.333 | 0.333 | |
## | 0.143 | 0.070 | 0.049 | 0.032 | 0.032 | 0.019 | 0.008 | 0.003 | 0.008 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## home | 39 | 15 | 13 | 15 | 11 | 4 | 2 | 1 | 3 | 103 |
## | 0.000 | 0.367 | 0.430 | 1.127 | 0.491 | 0.113 | 0.102 | 0.033 | 0.098 | |
## | 0.379 | 0.146 | 0.126 | 0.146 | 0.107 | 0.039 | 0.019 | 0.010 | 0.029 | 0.278 |
## | 0.279 | 0.238 | 0.232 | 0.366 | 0.344 | 0.235 | 0.222 | 0.333 | 0.333 | |
## | 0.105 | 0.041 | 0.035 | 0.041 | 0.030 | 0.011 | 0.005 | 0.003 | 0.008 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## other | 9 | 4 | 5 | 5 | 3 | 4 | 1 | 1 | 2 | 34 |
## | 1.161 | 0.553 | 0.004 | 0.403 | 0.001 | 3.804 | 0.036 | 1.903 | 1.664 | |
## | 0.265 | 0.118 | 0.147 | 0.147 | 0.088 | 0.118 | 0.029 | 0.029 | 0.059 | 0.092 |
## | 0.064 | 0.063 | 0.089 | 0.122 | 0.094 | 0.235 | 0.111 | 0.333 | 0.222 | |
## | 0.024 | 0.011 | 0.014 | 0.014 | 0.008 | 0.011 | 0.003 | 0.003 | 0.005 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## reputation | 39 | 18 | 20 | 9 | 6 | 2 | 3 | 0 | 1 | 98 |
## | 0.099 | 0.103 | 1.800 | 0.318 | 0.723 | 1.391 | 0.159 | 0.795 | 0.803 | |
## | 0.398 | 0.184 | 0.204 | 0.092 | 0.061 | 0.020 | 0.031 | 0.000 | 0.010 | 0.265 |
## | 0.279 | 0.286 | 0.357 | 0.220 | 0.188 | 0.118 | 0.333 | 0.000 | 0.111 | |
## | 0.105 | 0.049 | 0.054 | 0.024 | 0.016 | 0.005 | 0.008 | 0.000 | 0.003 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 140 | 63 | 56 | 41 | 32 | 17 | 9 | 3 | 9 | 370 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
As expected, there seem indeed to be some relationship between travel distance and alcohol consumption. We see that as the distance increases, the total score decreases.
library(gmodels)
CrossTable(alc$traveltime, alc$alc_use)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | alc$alc_use
## alc$traveltime | 1 | 1.5 | 2 | 2.5 | 3 | 3.5 | 4 | 4.5 | 5 | Row Total |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 95 | 42 | 37 | 24 | 21 | 11 | 6 | 1 | 5 | 242 |
## | 0.129 | 0.015 | 0.004 | 0.296 | 0.000 | 0.001 | 0.002 | 0.472 | 0.134 | |
## | 0.393 | 0.174 | 0.153 | 0.099 | 0.087 | 0.045 | 0.025 | 0.004 | 0.021 | 0.654 |
## | 0.679 | 0.667 | 0.661 | 0.585 | 0.656 | 0.647 | 0.667 | 0.333 | 0.556 | |
## | 0.257 | 0.114 | 0.100 | 0.065 | 0.057 | 0.030 | 0.016 | 0.003 | 0.014 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 40 | 16 | 17 | 14 | 4 | 4 | 1 | 1 | 2 | 99 |
## | 0.172 | 0.044 | 0.271 | 0.837 | 2.431 | 0.066 | 0.823 | 0.048 | 0.069 | |
## | 0.404 | 0.162 | 0.172 | 0.141 | 0.040 | 0.040 | 0.010 | 0.010 | 0.020 | 0.268 |
## | 0.286 | 0.254 | 0.304 | 0.341 | 0.125 | 0.235 | 0.111 | 0.333 | 0.222 | |
## | 0.108 | 0.043 | 0.046 | 0.038 | 0.011 | 0.011 | 0.003 | 0.003 | 0.005 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 4 | 4 | 2 | 3 | 6 | 1 | 1 | 0 | 0 | 21 |
## | 1.960 | 0.050 | 0.437 | 0.195 | 9.638 | 0.001 | 0.468 | 0.170 | 0.511 | |
## | 0.190 | 0.190 | 0.095 | 0.143 | 0.286 | 0.048 | 0.048 | 0.000 | 0.000 | 0.057 |
## | 0.029 | 0.063 | 0.036 | 0.073 | 0.188 | 0.059 | 0.111 | 0.000 | 0.000 | |
## | 0.011 | 0.011 | 0.005 | 0.008 | 0.016 | 0.003 | 0.003 | 0.000 | 0.000 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 2 | 8 |
## | 1.357 | 0.096 | 1.211 | 0.886 | 0.137 | 1.088 | 3.333 | 13.482 | 16.750 | |
## | 0.125 | 0.125 | 0.000 | 0.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.250 | 0.022 |
## | 0.007 | 0.016 | 0.000 | 0.000 | 0.031 | 0.059 | 0.111 | 0.333 | 0.222 | |
## | 0.003 | 0.003 | 0.000 | 0.000 | 0.003 | 0.003 | 0.003 | 0.003 | 0.005 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 140 | 63 | 56 | 41 | 32 | 17 | 9 | 3 | 9 | 370 |
## | 0.378 | 0.170 | 0.151 | 0.111 | 0.086 | 0.046 | 0.024 | 0.008 | 0.024 | |
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
We now create a logistic regression model using our selected variables. The first thing to note is the Z-statistics. The sex (Male for reference class) and absences seem to be the most relevant variables. I assume using only male for reference is ok, as we saw mostly very similar values when analysing them previously. It is followed by travel distance, as we indicated before. It is interesting to see that for some of the reasons there is a high value, meaning they are likely less useful for our model. However, of them the variable “other” still seems to hold some contribution.
Regarding the odds ratios, we have that a value of 1 would mean the two variables are independent. Larger than 1 means they are correlated and lower than 1 means they are negatively correlated. For our purposes I would argue that any value considerably away from 1 is useful for our model. After all, a strong indication of a value or the negation of that value would help guide the model. For our case, we see that sex indeed has a significantly larger than 1 odds ratio, with even its lower confidence interval above 1. Meaning a good correlation between it and our target. Absences on the other hand shows us that there is a consistent correlation, but the distance to 1 is much smaller. This was not shown as clearly in our previous metric. Travel time on the other hand shows a larger upper confidence interval, meaning that for some cases it does give us more information. The reason variable has the most interesting behavior in my opinion. Although before they had a high value for the Z-statistics, we now see that the odd ratio is not super good, but from their confidence intervals we see that they sometimes have a negative correlation and sometimes a positive one. I think this actually means they are not good predictive variables, as their relationship seems to change depending on the value. This would make learning the model, in my opinion, more complicated.
# find the model with glm()
m <- glm(high_use ~ sex + absences + traveltime + reason, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + absences + traveltime + reason,
## family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.50233 0.38742 -6.459 1.05e-10 ***
## sexM 0.97973 0.25027 3.915 9.05e-05 ***
## absences 0.10076 0.02457 4.101 4.11e-05 ***
## traveltime 0.41903 0.16736 2.504 0.0123 *
## reasonhome 0.12618 0.30888 0.409 0.6829
## reasonother 0.89956 0.41784 2.153 0.0313 *
## reasonreputation -0.36535 0.33099 -1.104 0.2697
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 401.49 on 363 degrees of freedom
## AIC: 415.49
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) sexM absences traveltime
## -2.5023266 0.9797277 0.1007637 0.4190261
## reasonhome reasonother reasonreputation
## 0.1261831 0.8995610 -0.3653526
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.08189424 0.0373175 0.1710846
## sexM 2.66373089 1.6404768 4.3849742
## absences 1.10601531 1.0565900 1.1635523
## traveltime 1.52048007 1.0957873 2.1182805
## reasonhome 1.13448993 0.6171224 2.0781000
## reasonother 2.45852348 1.0792890 5.6012936
## reasonreputation 0.69395194 0.3583216 1.3181019
With our model we can now make predictions. For this case we assume a probability larger than 0.5 to be TRUE, and lower than that FALSE. The correct predictions is the left to right diagonal in our table, and the wrong predictions the diagonal from right to left. In total we had 89 wrong cases, out of 370 cases, leading us to a 0.24 loss. If we just guessed values, one might expect a 50% error rate. In that case, we are doing pretty good.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
## failures absences sex high_use probability prediction
## <dbl> <dbl> <chr> <lgl> <dbl> <lgl>
## 1 0 3 M FALSE 0.436 FALSE
## 2 1 0 M FALSE 0.554 TRUE
## 3 1 7 M TRUE 0.537 TRUE
## 4 0 1 F FALSE 0.340 FALSE
## 5 0 6 F FALSE 0.268 FALSE
## 6 1 2 F FALSE 0.132 FALSE
## 7 0 2 F FALSE 0.132 FALSE
## 8 0 3 F FALSE 0.204 FALSE
## 9 0 4 M TRUE 0.430 FALSE
## 10 0 2 M TRUE 0.484 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 78 33
Since we noticed that the reason variable seems to be unreliable for the predictions, we try a new model without it. We actually have a lower score, having missed 91 values, meaning a 0.246 loss. This could be because at least some of the values actually contributed in some cases, meaning a net loss when not considering this variable.
m <- glm(high_use ~ sex + absences + traveltime, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
## failures absences sex high_use probability prediction
## <dbl> <dbl> <chr> <lgl> <dbl> <lgl>
## 1 0 3 M FALSE 0.417 FALSE
## 2 1 0 M FALSE 0.347 FALSE
## 3 1 7 M TRUE 0.515 TRUE
## 4 0 1 F FALSE 0.177 FALSE
## 5 0 6 F FALSE 0.345 FALSE
## 6 1 2 F FALSE 0.137 FALSE
## 7 0 2 F FALSE 0.137 FALSE
## 8 0 3 F FALSE 0.207 FALSE
## 9 0 4 M TRUE 0.441 FALSE
## 10 0 2 M TRUE 0.492 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 249 10
## TRUE 81 30
We now use cross validation with k = 10 to check how good our model actually is. The idea is to train and test with different combinations of values, in order to see how good the model reacts to our of training examples. We see that the error is a bit higher than before, which is expected. However, seems like our model slightly outperforms the model suggested by the exercises, which had a loss of 0.26.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2540541
Here we take a look at the data. This example dataset has 506 rows and 14 columns. The data itself is about housing values in suburbs of boston. For example, some variables represent per capita crime rate by town (crim) and average number of rooms per dwelling (rm).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
We can use the pairs function to take a further look into the data and their relationships. This function plots each variable against the other. We can also see the discrete variables. For example, “chas” which takes values of 1 or 0. The plot therefore shows how the other variables relate to their values. Some variables show a strong relation such as “lstat” and “medv”. This makes sense, as the former is the percentage of lower status of the population and the latter is the median value of owner-occupied homes in $1000s. I think this plots also shows us why we need to standardize the dataset. Different variables also have very different scales.
# plot the Boston dataset with clusters
pairs(Boston)
The different scales can be better seen here. We see for example that for “crim” the values stay between 0 and 100, whereas a variable such as “nox” stays between 0.3 and 0.9.
# plot the Boston dataset with clusters
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We now use the function scale to scale (duh) the dataset. Although the variables are not in the same exact range, we see that they all have the same magnitude. The idea of standardizing the data like this is to center it around 0 with a standard deviation of 1. We can see it by summarizing the data again.
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
By using quantiles we can discretize the crime variable and turn it from numeric to categorical. Looking at the summary again, we can see the new crime variable.
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 [-0.419,-0.411]:127
## 1st Qu.:-0.5989 (-0.411,-0.39] :126
## Median :-0.1449 (-0.39,0.00739]:126
## Mean : 0.0000 (0.00739,9.92] :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
For training and validation it is common to separate the data into a training dataset and a validation set. We now separate 80% of the data for training and 20% for validation. Since we want to learn to predict the crime variable, we remove it from the test dataset.
n <- nrow(Boston)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Linear discriminant analysis (LDA) is more commonly used for dimensionality reduction. The idea is to find a linear combination of features that can describe or differenciate two or more classes of target objects or events. However, we can also use it as a linear classifier. The latter is done below.
We also cross reference the correct classes vs the predictions. The diagonal from left to right represent the correct predictions. We can easily see that the majority of the predictions was correct.
lda.fit <- lda(crime ~ ., data = train)
predictions <- predict(lda.fit, newdata = test)
library(gmodels)
CrossTable(correct_classes, predictions$class)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 102
##
##
## | predictions$class
## correct_classes | [-0.419,-0.411] | (-0.411,-0.39] | (-0.39,0.00739] | (0.00739,9.92] | Row Total |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## [-0.419,-0.411] | 13 | 11 | 1 | 0 | 25 |
## | 21.017 | 2.131 | 5.730 | 6.373 | |
## | 0.520 | 0.440 | 0.040 | 0.000 | 0.245 |
## | 0.812 | 0.379 | 0.032 | 0.000 | |
## | 0.127 | 0.108 | 0.010 | 0.000 | |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## (-0.411,-0.39] | 2 | 14 | 11 | 0 | 27 |
## | 1.180 | 5.209 | 0.951 | 6.882 | |
## | 0.074 | 0.519 | 0.407 | 0.000 | 0.265 |
## | 0.125 | 0.483 | 0.355 | 0.000 | |
## | 0.020 | 0.137 | 0.108 | 0.000 | |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## (-0.39,0.00739] | 1 | 4 | 19 | 3 | 27 |
## | 2.471 | 1.761 | 14.199 | 2.190 | |
## | 0.037 | 0.148 | 0.704 | 0.111 | 0.265 |
## | 0.062 | 0.138 | 0.613 | 0.115 | |
## | 0.010 | 0.039 | 0.186 | 0.029 | |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## (0.00739,9.92] | 0 | 0 | 0 | 23 | 23 |
## | 3.608 | 6.539 | 6.990 | 50.094 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.225 |
## | 0.000 | 0.000 | 0.000 | 0.885 | |
## | 0.000 | 0.000 | 0.000 | 0.225 | |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## Column Total | 16 | 29 | 31 | 26 | 102 |
## | 0.157 | 0.284 | 0.304 | 0.255 | |
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
##
##
Instead of using the crime as a target, we can try and cluster the data using k-means. For that, we will use the distance between points. This can be done using many distance computing algorithms. For our purposes I believe the euclidean distance will suffice.
library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
dist <- dist(boston_scaled)
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
This is a clustering method. It will cluster the data into k different clusters. Since this method minimizes the within cluster variance, we see why I choose to look at their euclidean distances. That is, k-means minimizes the squared euclidean distances.
I also plot the results for a reduced ammount of variables. We see that maybe 3 clusters was too much, as the vast majority of points fall between one (red) or the other (black) but almost no points fall in the third cluster (green)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)
For that reason I redo the k-means using only 2 centers. We see that the data has a clear separation between the clusters, without too much intersections. This is good, as one of the issues of k-means is that sometimes it can bleed points from one cluster to the other due to its tendency of creating cluster of similar sizes.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)
We now perform LDA using the clusters as targets instead of the variable crime. Since we will need a proper clustering for our predictions, I will do a small search of different cluster numbers with a set seed to remove the randomness of selecting the initial clusters.
library(ggplot2)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
This metric can give us an indication of the number of clusters to use. We see a drastic reduction of it around 3 clusters, and so this is the value we will use next. It is important to note that this goes against my previous experience. However, now that we set the seed our k-means should have a similar behavior to this last one.
library(MASS)
set.seed(123)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
km <- kmeans(boston_scaled, centers = 3)
boston_scaled$cluster <- km$cluster
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$cluster
test <- dplyr::select(test, -cluster)
lda.fit = lda(cluster ~ ., data=train)
With our new model, we can once again predict and see a table of results. And it looks good! The vast majority of points have been correctly predicted. We can also see what I mentioned before. The sizes of the clusters are remarkably similar, all of them around 30 points.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct 1 2 3
## 1 30 5 0
## 2 0 31 5
## 3 0 2 29
We can also plot our results using arrows to represent the relationship of the original variables to the LDA solution. We see that both “rad” and “age” seem to be the major variables in this new space. Naturally the other variables have some effect, like “tax”. Still, these two seem to be the most influential. This is supported both by their lengths (meaning a larger variation) and the angle between them, showing almost no correlation (90 degree angle, or close to) which is ideal for a vector space.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$cluster)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 2)
We can now merge our different approaches and compare them. We can see that both plots are in fact quite similar. However, I would argue that the k-means one is in fact better, as we can see that in the cluster 1 (purple) we have all points actually together in the left part of the plot, whereas in the previous one we had some intersection of different values there. It is important to note, however, that I used our value of 3 clusters. The previous coloring however had more options, and so this difference might actually be quite minimal. Specially since the points colored there were in fact the next closes value (high and mid high).
##################################
library(MASS)
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.fit = lda(crime ~ ., data=train)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# And now with k-means
km <- kmeans(train %>% dplyr::select(-crime), centers = 3)
train$cluster <- km$cluster
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$cluster)
As usual, even though I finished the wrangling data part, I always rather use the one given by the lecturer.
library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our first step is to move the country column to rownames.
# move the country names to rownames
library(tibble)
human <- column_to_rownames(human, "Country")
We can then take a closer look at the data. From the previous step we have now each country as an index. Each column therefore refers to the data of that country. We have
We can also notice that some variables have much larger magnitudes then others. This can make further analysis difficult, and so later on we will scale the data to solve this.
head(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 81.6 17.5 64992 4 7.8
## Australia 0.9968288 0.8189415 82.4 20.2 42261 6 12.1
## Switzerland 0.9834369 0.8251001 83.0 15.8 56431 6 1.9
## Denmark 0.9886128 0.8840361 80.2 18.7 44025 5 5.1
## Netherlands 0.9690608 0.8286119 81.6 17.9 45435 6 6.2
## Germany 0.9927835 0.8072289 80.9 16.5 43919 7 3.8
## Parli.F
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
I personally enjoy the ggpairs function to get an idea of the data. On the diagonal we see the distribution of the data. On the lower part of the plots we see the relation of variable values in relation to the other variables. On the upper part of the plots we have the correlation of the variables.
library(GGally)
ggpairs(human, progress = FALSE)
We now use PCA in our data without any more preprocessing. It linearly transforms the data into a new coordinate system. We can use this for dimensionality reduction, for example in machine learning tasks. As expected, GNI overwhelms all other variables in the representation due to its much larger magnitude.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
However, we can redo the previous analysis on a scaled version of the data. We now see a much more distributed spread of vector magnitudes for the different variables.
For example, we see that variables which have a close relationship in real life end up having similar vectors. Look at GNI and secondary education. It is well known that wealthier countries also have larger amount of education. Interestingly, this relationship is not indicated by the correlation of these two variables. Other related variables are the life expectancy and expected years of education.On the other side we have adolescent births and mother mortality. These two variables have similar vectors, as they have a strong interaction statistically. The final two vectors relate to female participation in parliament and work force.
From that, my interpretation of these component dimensions is country wealth for PC1 and female participation (power) both in politics and economically for PC2.
library(dplyr)
human_std <- scale(human)
human_std2 <- human %>% rename(`Female secondary education` = Edu2.FM, `Female in the work force` = Labo.FM,
`Life expectancy` = Life.Exp, `Expected education` = Edu.Exp, `Gross income per capta` = GNI,
`Maternal mortality` = Mat.Mor, `Adolescent birth` = Ado.Birth, `Female participation in parliament` = Parli.F) %>% scale
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
pca_human2 <- prcomp(human_std2)
biplot(pca_human2, choices = 1:2, cex = c(0.1, 0.7), col = c("grey40", "deeppink2"))
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
For this next part we will make use of the “tea” dataset. Taking a basic look at the data using summary and view. It is a fun dataset, with difference types of tea, how they are consumed, with or no sugar, where were they bought and if they drink said tea at lunch.
library(FactoMineR)
tea_time <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea_time.csv", stringsAsFactors = TRUE)
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
view(tea_time)
Multiple correspondence analysis represents data in a low dimensional space. It can be seen as similar to PCA, but can be applied to categorical data.
For the factor map, we can interprete the distance between points as a metric of their similarity. For example, tea bags seem close to chain shops, wereas unpackaged tea seems close to tea shops. We see a similar behavior in the biplot.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_biplot(mca)
We start by creating the data from the data wrangling step.
library(readr)
library(dplyr)
bprs <- read_delim("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", delim = " ")
## Rows: 40 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (11): treatment, subject, week0, week1, week2, week3, week4, week5, week...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rats <- read_delim("data/rats.txt", delim = "\t")
## Rows: 16 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (13): ID, Group, WD1, WD8, WD15, WD22, WD29, WD36, WD43, WD44, WD50, WD5...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(bprs)
## [1] 40 11
dim(rats)
## [1] 16 13
summary(bprs)
## treatment subject week0 week1 week2
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00 Min. :26.0
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## Median :1.5 Median :10.50 Median :46.00 Median :41.00 Median :38.0
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33 Mean :41.7
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00 Max. :75.0
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
summary(rats)
## ID Group WD1 WD8 WD15
## Min. : 1.00 Min. :1.00 Min. :225.0 Min. :230.0 Min. :230.0
## 1st Qu.: 4.75 1st Qu.:1.00 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## Median : 8.50 Median :1.50 Median :340.0 Median :345.0 Median :347.5
## Mean : 8.50 Mean :1.75 Mean :365.9 Mean :369.1 Mean :372.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## Max. :16.00 Max. :3.00 Max. :555.0 Max. :560.0 Max. :565.0
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
bprs_long <- melt(setDT(bprs), id.vars = c("treatment","subject"), variable.name = "week")
rats_long <- melt(setDT(rats), id.vars = c("ID","Group"), variable.name = "time")
summary(bprs_long)
## treatment subject week value
## Min. :1.0 Min. : 1.00 week0 : 40 Min. :18.00
## 1st Qu.:1.0 1st Qu.: 5.75 week1 : 40 1st Qu.:27.00
## Median :1.5 Median :10.50 week2 : 40 Median :35.00
## Mean :1.5 Mean :10.50 week3 : 40 Mean :37.66
## 3rd Qu.:2.0 3rd Qu.:15.25 week4 : 40 3rd Qu.:43.00
## Max. :2.0 Max. :20.00 week5 : 40 Max. :95.00
## (Other):120
summary(rats_long)
## ID Group time value
## Min. : 1.00 Min. :1.00 WD1 :16 Min. :225.0
## 1st Qu.: 4.75 1st Qu.:1.00 WD8 :16 1st Qu.:267.0
## Median : 8.50 Median :1.50 WD15 :16 Median :344.5
## Mean : 8.50 Mean :1.75 WD22 :16 Mean :384.5
## 3rd Qu.:12.25 3rd Qu.:2.25 WD29 :16 3rd Qu.:511.2
## Max. :16.00 Max. :3.00 WD36 :16 Max. :628.0
## (Other):80
This dataset is about the weights of rats over a period of weeks. We can get a better idea of the values (wegiths) by creating a summary.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ purrr 1.0.2
## ✔ lubridate 1.9.3 ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ data.table::first() masks dplyr::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ data.table::last() masks dplyr::last()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ plotly::select() masks MASS::select(), dplyr::select()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rstatix)
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following object is masked from 'package:stats':
##
## filter
group_by(rats_long, time) %>%
get_summary_stats(value)
## # A tibble: 11 × 14
## time variable n min max median q1 q3 iqr mad mean sd
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 WD1 value 16 225 555 340 252. 480 228. 145. 366. 126.
## 2 WD8 value 16 230 560 345 255 476. 221. 141. 369. 124.
## 3 WD15 value 16 230 565 348. 255 486. 231. 141. 372. 127.
## 4 WD22 value 16 232 580 352. 267. 492. 225. 139. 379. 127.
## 5 WD29 value 16 240 590 356. 269. 498. 229 140. 384. 129.
## 6 WD36 value 16 240 597 360 267. 504. 237 145. 387 132.
## 7 WD43 value 16 243 595 360 269. 501 232. 140. 386 128.
## 8 WD44 value 16 244 595 362 270 510. 240. 142. 388. 130.
## 9 WD50 value 16 238 612 370 274. 516 242. 156. 395. 135.
## 10 WD57 value 16 247 618 374. 274. 524. 251. 153. 399. 136.
## 11 WD64 value 16 245 628 378 278 531. 253. 157. 404. 140.
## # ℹ 2 more variables: se <dbl>, ci <dbl>
First we create boxplots with time in the x-axis and value on y-axis. Meaning we have over the weeks the values of the rows with the given ids. We can see a clear increase on the median of value over the weeks. Makes sense, as the value here is in fact the weight of the rats. We also know that there were 3 groups given different diets. This can also be seen (clusters of black points).
library(ggplot2)
ggplot(rats_long, aes(time, value, fill = time)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
guides(fill = "none")
We can see the behavior of the different groups clearer by creating a plot of the mean value per group over the weeks.
aggregated_rats <-
rats_long %>%
group_by(Group, time) %>%
summarize(mean = mean(value,na.rm=TRUE))
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(aggregated_rats, aes(time, mean, col = Group)) +
geom_point()
In this dataset we have data about 40 males that were subjected to different treatments. The main information is the assessment of 18 symptoms and their severity. So over time, hopefully the treatments worked and we see a decline on the severity of symptoms.
From the summary we can already see that in average there was a strong reduction on the total symptom intensity.
library(tidyverse)
library(rstatix)
group_by(bprs_long, week) %>%
get_summary_stats(value)
## # A tibble: 9 × 14
## week variable n min max median q1 q3 iqr mad mean sd
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 week0 value 40 24 78 46 38 58.2 20.2 14.1 48 14.1
## 2 week1 value 40 23 95 41 35 54.2 19.2 12.6 46.3 16.3
## 3 week2 value 40 26 75 38 32 49 17 10.4 41.7 12.5
## 4 week3 value 40 24 76 36.5 29.8 44.5 14.8 10.4 39.2 12.5
## 5 week4 value 40 20 66 34.5 28 43 15 11.1 36.4 11.0
## 6 week5 value 40 20 64 30.5 26 38 12 8.15 32.6 9.25
## 7 week6 value 40 19 64 28.5 22.8 37 14.2 11.1 31.2 10.2
## 8 week7 value 40 18 62 30 23 38 15 10.4 32.2 11.5
## 9 week8 value 40 20 75 28 22.8 35.2 12.5 8.90 31.4 12.3
## # ℹ 2 more variables: se <dbl>, ci <dbl>